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Abstract. Phylogenetic diversity is a measure for describing how much of 
an evolutionary tree is spanned by a subset of species. If one applies this 
to the (unknown) subset of current species that will still be present at some 
future time, then this 'future phylogenetic diversity' provides a measure of 
the impact of various extinction scenarios in biodiversity conservation. In this 
paper we study the distribution of future phylogenetic diversity under a simple 
model of extinction (a generalized 'field of bullets' model). We show that the 
distribution of future phylogenetic diversity converges to a normal distribution 
as the number of species grows (under mild conditions, which are necessary). 
We also describe an algorithm to compute the distribution efficiently, provided 
the edge lengths are integral, and briefly outline the significance of our findings 
for biodiversity conservation. 
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1. Introduction 

The current rapid rate of extinction of many diverse species has focused attention 
on predicting the loss of future biodiversity. There are numerous ways to measure 
the 'biodiversity' of a group of species, and one which recognises the evolutionary 
linkages between taxa (for example, species) is phylogenetic diversity ([3], 0], [H]). 
Briefly, given a subset of taxa, the PD (phylogenetic diversity) score of that subset 
is the sum of the lengths of the edges of the evolutionary tree that connects this 
subset (formal definitions are given shortly). Here the 'length' of an edge may refer 
to the amount of genetic change on that edge, its temporal duration or perhaps 
other features (such as morphological diversity). 

Under the simplest models of speciation, each taxon has the same probability of 
surviving until some future time, and the survival of taxa are treated as independent 
events; this is a simple type of 'field of bullets' model ([5], [T2], [II])- This model 
is quite restrictive ([11]) and a more realistic extension allows each species to have 
its own survival probability - this is the model we study in this paper. Under this 
model, we would like to be able to predict the PD score of the set of taxa that 
survive. This 'future PD' is a random variable with a well-defined distribution, but 
to date, most attention has focused on just its mean (that is, the expected PD score 
of the species that survive). For example, the 'Noah's Ark problem' ([61 [T51 ITU]) 
attempts to maximize expected future PD by allocating resources that increase 
the survival probabilities in a constrained way. Clearly, one could consider other 
properties of the distribution of future PD - for example the probability (let us 
call it the PLq value) that future PD is less than some critical lower limit (Lq). 
Given different conservation strategies, we may wish to maximize expected PD or 
minimize the PLq value. A natural question is how are these two quantities related? 

To address these sorts of questions, we need to know the full distribution of fu- 
ture PD. In this paper, we show that for large trees, future PD is (asymptotically) 
normally distributed. Given the increasing trend in biology of constructing and 
analysing phylogenetic trees that contain large numbers of species (10 2 — 10 3 ), we 
see this result as timely. Our work was also motivated by the suggestive form of 
distributions obtained by simulating future PD by sampling 12-leaf subtrees ran- 
domly from 64- leaf trees from Nee and May ([9], see also [H]). To formally prove 
the normal limit law requires some care, as future PD is not a sum of indepen- 
dent random variables (even though the survival events for the taxa at the leaves 
are treated independently); consequently, the usual central limit theory does not 
immediately apply. 

This limit law has some useful consequences for applications. For example, it 
means that for a large tree, the PLq value can be estimated by the area under a 
normal curve 

t0 th e left of L r EfPD1 - In particular , we see that the relation between 

y/y^[PD] 

the PL value and expected future PD (E[P_D]) involves scaling by the standard 
deviation of future PD (so strategies that aim to maximize expected future PD may 
not necessarily minimize the PLq value). 
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Our normal distribution result is asymptotic - that is, it holds for large trees. 
However, it is also useful to have techniques for calculating the exact PD distri- 
bution on any given tree. In Section [31 we show how this may be achieved by a 
polynomial time algorithm under the mild assumption that each edge length is an 
integer multiple of some fixed length. In Section |4l we show how our results can be 
easily modified to handle an 'unrooted' form of PD that has also been considered 
in the literature. 

1.1. Definitions and preliminaries. Throughout this paper X will denote a set 
of taxa (for example, different species, different genera or populations of the same 
species) and X' will denote a subset of X. A rooted phylogenetic X-tree is a rooted 
tree in which (i) all edges are oriented away from the root, (ii) X is the set of 
leaves (vertices of the tree with no outgoing edges) and (iii) every vertex except 
the leaves (and also possibly the root) has at least two out-going edges (allowing 
the root to have just one outgoing arc will be useful later). In systematic biology, 
these trees are used to represent evolutionary development of the set X of taxa 
from their common ancestor (the root of the tree), and the orientation of the edges 
corresponds to temporal ordering. Given a rooted phylogenetic X-tree T, we let 
E(T) denote the set of edges, and Ep(T) denote the set of pendant edges (edges 
that are incident with a leaf). 

Suppose we have a rooted phylogenetic X-tree T and a map A that assigns a 
non-negative real- valued length A e to each edge e of T. Given the pair (T, A) and a 
subset X' of X, the phylogenetic diversity of X' ', denoted PD(q-.\)(X') - or, more 
briefly, PD(X') - is the sum of the A e values of all edges that lie on at least one 
path between an element of X' and the root of T. In the (generalized) field of 




Figure 1 . If only the taxa marked * in the tree on the left survive 
then the future phylogenetic diversity is the sum of the lengths of 
the solid edges in the tree on the right. 

bullets model (g-FOB), we have a triple (T, X,p) where T is a rooted phylogenetic 
X— tree, A is an edge length assignment map, and p is a map that assigns to each 
leaf i£la probability pi. Construct a random set X' by assigning each element 
i of X to X' independently with probability pi. In biodiversity conservation we 
regard X' as the set of taxa that will still exist (that is, not be extinct) at some 
time t in the future; accordingly, we call pi the survival probability of i. 

Considering the random variable ip = ifT = PD(T,\){X'), which is the phyloge- 
netic diversity of the random subset X' of X (consisting of those taxa that 'survive') 
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according to the process just described, we call ip future phylogenetic diversity. An 
example of this process is shown in Fig. 1. 

Note that in the g-FOB model, we can write 

(1) Cf = J2^Y e , 

e 

where Y e is the binary random variable which takes the value 1 if e lies on at least 
one path between an element of X' and the root of T, and which is otherwise. 
Moreover, 

(2) P[Y e = l] = l- [] (i-J*). 

tec(e) 

where C(e) is the set of elements of X that are separated from the root of T by e. 
Consequently, if we let 

p e :=p[y e = i] = i- n (i-Pi)' 

iec(e) 

then 

(3) E[^]=^A e P e . 



Equation {1} suggests that for large trees, ip might be normally distributed, as 
it will be sum of many random variables (a normal distribution is also suggested 
by simulations described in [9J [14]). However, the random variables (X e Y e ) are not 
identically distributed and, more importantly, they are not independent. Therefore 
a straightforward application of the (usual) central limit theorem seems problem- 
atic. We show that under two mild restrictions, a normal law can be established 
for large trees. Moreover, neither of these two mild restrictions can be lifted (we 
exhibit a counter-example to a normal law in both cases). 

Since a normal distribution is determined once we know both its mean and vari- 
ance, it is useful to have equations for calculating both these quantities. Equation 
(|3| provides a simple expression for the mean, and we now present an expression 
for the variance that is also easy to compute. Given two distinct edges of T, we 
write e <r / if the path from the root of T to / includes edge e (or, equivalcntly, 
C(J) C C(e)). 

Lemma 1.1. 

Var[<p]='£\ 2 e P e (l-P e ) + 2 E A e A/P/(l-P 8 ). 

e (e,f):e< T f 

Proof. ^From Equation ([1]) we have: 

VarM = Cov[<p, <p] = ^ \ e \ } Cov[F e , Y f }. 

The covariance of Y e and Yf is 

Cov[r e , Y f ] = E[Y e Yf] - E[Y e ]E[Yf] = V[Y e = l,Y f = 1} - V[Y e = l]P[Y f = 1]. 
Now, we have the following cases: 
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(1) e^ / and neither e <t / nor / <x e. In this case, the subtree of T with 
root edge e and the subtree of T with root edge / do not have any leaves 
in common, and so Y e and Yt are independent. Thus, Cov[K e , Y/] = 0. 

(2) e <r f ■ In this case, C(f) C C(e) and so the survival of any taxon in 
C(/) implies the survival of a taxon in C(e); that is, Yf = 1 implies Y e = 1 
and we have Cov[Y* e , Y/] = P[Y) = 1] - P[Y e = 1]P[Y) = 1] = - P e ). 

(3) / <r e. This is analogous to case (2) (and, together with case (1), ex- 
plains the factor of 2 in the expression on the right-hand side of our formula 
for Var [<p]). 

(4) e = f. This case gives Cov[Y e , Y>] = P[Y e = l](l-P[Y e = 1]) = P e (l-P e ) 
(and corresponds to the first term on the right hand-side of our formula for 
Varfo]). 



By considering these cases for Cov[Y e , Y/], we obtain the result claimed. □ 

A consequence of this lemma is the following lower bound on the variance of 
future PD which will be useful later. 

Corollary 1.2. Consider the g-FOB model on (T, A,p). Then, 

Var[<p]> ]T X 2 e P e (l-P e ). 

e£E P (T) 

Proof. Notice that all the terms in the summation expression for Var[y>] in Lemma fl.ll 
are non-negative, and so a lower bound on Var[y] is obtained by summing over 
those pairs (e, /) for which e = / is a pendant edge of T. This gives the claimed 
bound. □ 



2. Asymptotic normality of future phylogenetic diversity under the 

g-fob model 



Consider a sequence of such rooted phylogenetic trees: 

Ti,T 2 , . . . ,T n , . . . 

where T n has a leaf label set X — {1, . . . , n}. Furthermore, suppose that for each 
tree we have an associated edge length function A = A^™) and a survival probability 
function p — p^ 71 ' . For the sequence of g-FOB models (T n ,\^\p^), we impose 
the following conditions (where Ep(T n ) is the set of pendant edges of T n ): 

(CI) For some e > and for each n, we have: 

^ (n) ^ -i 
e < Pi < 1 - e, 

for alH € {1, . . . , n} except for at most An a values of i, where A, a > are 
constants, with a < -|. 
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(C2) Let L(n) = max{Ae" : e € E(T n )}. Then, for each n, we have: 

e<EEp(T n ) 

for some constants B > 0, j3 > 2a. 

Remarks concerning conditions (CI), (C2). 

Condition (CI) simply says that the survival of most taxa is neither (arbitrarily 
close to) certain nor impossible. The term An a provides the flexibility to allow for 
some of the taxa to have a survival probability that is very close to, or even equal 
to, or 1. 

Condition (C2) says, roughly speaking, that the pendant edges are, on average, 
not too short in relation to the longest edge in the tree. This is relevant for evolu- 
tionary biology, as it follows that for trees generated by a constant speciation rate 
'pure birth' model (see, for example, [2]) condition (C2) holds in expectation (for 
any a € (0, ^)). A more formal statement of this claim, and its proof, is given in 
the Appendix. 

Note that if condition (C2) holds for a value (3 > then, is at most 1, since 
the terms in the summation expression in (C2) are all at most 1 and there are 0{n) 
of them. 

□ 

Next, we state our main theorem, which describes the asymptotic normality of 
future phylogenetic diversity ip n — Lpq- n . Since phylogenetic trees often contain a 
large number of taxa, the result allows one to approximate the distribution of future 
phylogenetic diversity with a normal distribution. 

Theorem 2.1. Under conditions (CI) and (C2), (tp n — E [ip n ] ) / \J Var[<^ n ] converges 
in distribution to N(0, 1) as n — » oo, where N(0, 1) denotes a standard normally 
distributed random variable. 

We pause to note that one cannot drop either condition (CI) or (C2) in Theo- 

rem l2.ll It is clear that dropping (CI) is problematic (for example, set p^ € {0, 1} 
for all i which leads to a degenerate distribution); as for (C2) the following example 
shows that we require (3 to be strictly positive. 

Example: Condition (C2) cannot be removed 

Consider a tree T n with n leaves. Leaves 1, . . . , n — 1 have incident edges that 
each have length , 1 , and all these edges are incident with a vertex that is ad- 

° y/n—l ° 

jacent to the root by an edge of length 1. Leaf n has edge length 1 (see Fig. [5]). 
Consider a sequence of g-FOB models with — s for alH, n, where s is any num- 
ber strictly between and 1. Then <p n = — =yl n + B n + C n where — =A„ is the 
contribution to tp n of the n — 1 edges that are incident with leaves 1, . . . ,n — 1; B n is 
the contribution to ip n of the edge that connects these n — 1 edges to the root of T n 
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and C n is the contribution to tp n of the edge incident with leaf n. Notice that A n is 
a sum of rt — 1 i.i.d. binary (0, 1) random variables, each of which takes the value 1 
with probability s, and C n is a binary random variable which takes the value 1 with 
probability s. Consequently, the variance of -j==A n equals s(l — s), the same as 
the variance of C„ . Moreover, B n converges in probability to 1, and C n is indepen- 
dent of A n and B n . Consequently, Var[(^ n ] — * 2s(l — s) as n — ► oo. Furthermore, 

1 A n -E[ 1 _ i A n ] 

by the standard central limit theorem, , v " converges in distribu- 

tion to N(Q, h) (a normal random variable with mean and variance ^). Thus, 
(ip n — E[i/3„])/yVar[^J converges to the random variable N(Q, |) + W where W is 
independent of N(0, \) and takes the value ) w ^ n probability s and takes 

the value ^j2~(i ) w ^ n probability 1 — s. In particular, (cp n — E[<^ n ])/ -^/Var[<^ n ] 
does not converge in distribution to AT(0, 1). Notice that in this example, (CI) is 
satisfied, but (C2) fails since J2e&E P (T n ) 




Figure 2 . A tree for which future phylogenetic diversity does not 
become normally distributed as n grows. 



□ 



We now provide a brief, informal outline of the approach we use to prove Theo- 
rem l2.ll The main idea is to decompose T n into a 'central core' and a large number 
of 'moderately small' pendant subtrees. Each edge in the central core separates the 
root from enough leaves so that we can be very sure that at least one of these leaves 
survives - consequently the combined PD-contribution of this central core converges 
in probability to a fixed (non-random) function of n. Regarding the pendant sub- 
trees, their contributions to the PD score are independent and although they are 
not identically distributed random variables, their combined variance grows suf- 
ficiently quickly that we can establish a normal law for their sum by a standard 
central limit theorem. 



Proof of Theorem \2.1\ We first note that it is sufficient to establish Theorem 12.11 
under (CI) and the seemingly stronger condition: 

(C2*) L(n) = 1, and EeeBpfT^O^) 2 > BvP for constants B > 0,(3 > 2a. 

To see why, suppose we have established Theorem 12.11 under (CI), (C2*). For 
a sequence T n (with associated maps \( n \ pW) satisfying (CI), (C2), let [if^ = 
L(n)~ 1 \e^ for each edge e of T n and each n. Note that, by Equation (TT]), the 
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normalized ip score (namely (ip n — E[<p n ] ) / y/Vax [ip n ] ) ) for (T n , jj^ n \p^ n ') equals the 
normalized cp score for (T„, \^ n \p^) and that (T n , M (n) ,p (n) ) satisfies (C2*). Thus 
we will henceforth assume conditions (CI) and (C2*). 

Next, we make a notational simplification: for the remainder of the proof, we 
will write Ae as A e and pV 1 ' as pi (but respecting in the proof that these quantities 
depend on n). Also, for a sequence of random variables (Y n ), we write Y n — > a to 

denote that Y n converges in probability to a constant a, and Y n — > Y to denote 
that Y n converges in distribution to a random variable Y. 

Since > 2a, we may select a value 7 with a < 7 < (3/2, and set f(ri) := n 1 . 
We partition the edges of T n into two classes E™ and E% and we define a third class 
E12 Q E" as follows: Let n e denote the number of leaves of T n that are separated 
from the root by e. Then set: 

• E™: edges e of T n with n e < f(n); 

• EV,: edges e of T n with n e > /(n); 

• £7" 2 : edges e€ £" such that e is adjacent to an edge f £ E%. 
For an edge e € E 1 ^ of T n , we make the following definitions: 

• t e denotes the subtree of T n consisting of edge e and all other edges of T n 
that are separated from the root by e. 

• ip™ denotes the future phylogenetic diversity of t e , under the probabilistic 
model described above. 

See Fig. 3 for a schematic summary of these concepts. 




< f(n) < f(n) < /(n) 



Figure 3. A representation of the decomposition of T n in the 
proof of Theorem 12. f I 

For ip n , Equation |T]) gives 

(4) <f n = X eY e + X eY e = J2 <Pc + Yl X - Y - 

e£E™ e£E% e£E^ 2 eSBJ 



DISTRIBUTION OF PHYLOGENETIC DIVERSITY 



9 



Let 

A„ = ^2 A e , Z n = ^ Ve> and R n = ^2 A e (l - Y e ). 

With this notation, we can re-write ((4]) as 



(5) ip n = A„ + Z n - R n . 



Lemma 2.2. R n A 0. 



Proof. Since Var[-R„] = E[P 2 ] -E[P„] 2 and E[P 2 ] > E[P„] 2 , it is sufficient to show 
that E[P 2 ] — > (the claim that i? n — > then follows by Chebyshev's inequality) 
We have P„ = J2 e eE n A e (l — ^e) an d so 



p, 2 = £ A e A/(i - y e )(i - yy) < |^| £(i-y e ), 

e,/GB 2 " eGBJ 

since A e , A/ < 1 by (C2*), and (1 -Y f )<l for all / € JB£. Thus, 



(6) E[P 2 ] < |P™| 2 • max{P[y = 0] : e G E%}. 

Now, for any edge e G -BJ there are at least n 1 — An a elements i of C(e) for which 
K > e (by (CI)) and thus 

p[y e = 0] < (l-e) n7 - An °. 
Since |J^| < 2n, Equation ([6|) and the inequality a < 7 gives 

E[i? 2 ] < 4n 2 • (1 - e ) nl - Ana as n -»• 00, 
as required. □ 
Lemma 2.3. Under conditions (CI) and (Clt), we have 

(A(")) 2 P e (l-P e )>Se 2 (l + (l))^, 

eG-Ep (T„) 

where o(l) denotes a term that tends to as n 00. 

Proof. Let J7„ be the set of those pendant edges e of 7^ for which the leaf incident 
with e has its survival probability in the interval [e, 1 — e], and let V n denote the 
set of the remaining pendant edges of T n . Clearly, 

(7) £ (A(")) 2 P e (l-P e )> e 2 ^(A(")) 2 , 

eS_Bp(T„) e£(7„ 

and by (C2*) we have 

(8) Bn?< ( X i n) ) 2 < E( A ^) 2 + I^I 

eEE P (T n ) e£l7„ 
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where the last term (|T4,|) is an upper bound on J2 e ev (^e ) 2 by virtue of the 
bound |Ai n) | < 1 (by (C2*)). Since \V n \ < An a , Equations © and © give 

(Ai n) ) 2 Pe(l - P e ) > e 2 {Bn' 3 - An a ) = Be 2 {l + o{l))n f} . 

□ 

Lemma 2.4. The random variable ip n = (Z n - E[Z n ])/ Var[Z n ] N(0, 1). 

Proof. We can apply a version of the central limit theorem for double arrays of ran- 
dom variables. The required theorem can be found in [13] and states the following. 
For each n, let X n i, . . . , X nr be r = r(n) independent random variables with finite 
pth moments for some p > 2. Let 



j 3 

If 

(9) fl-f/a E[\X nj - nx n] ]\P] > as n > oo , 

3 

then W n = (J2jX n j — A n )/^fb\ N(0,1). We apply this theorem by taking 
{X n i, . . . , X nr } — {(p™ : e £ E™ 2 }, since the random variables {</?": e € E" 2 } are 
clearly independent. With our notation Z n = X^eeB" fei we have A n = E[Z n ], 
B n = Var[Z„] and W n = ip n . Thus, we only need to verify condition §9§ in order 
to establish Lemma HOI 

By Corollary 1 1.2[ we have: 

Var[#]> Yl ^fW-Pf)' 

feE P (t c ) 

This lower bound and the independence of {<p™ : e £ E™ 2 }, implies: 

B n = Var[Z„] = £ Var[^ 1 ] > £ £ X 2 f P f (l-P f ) 

eeE™ 2 e£E» feE P (t e ) 

Consequently, by Lemma l23l and the fact that every pendant edge occurs in Ep(t e ) 
for some e £ E™ 2 we obtain, 

(10) B n > Be 2 {l + o{l))vP. 
Consider now the absolute central moments in ©. We have 

nx nj nx n3 ]\ p ] = em n^w] < li, 

where L e is the sum of the lengths of the edges of t e . Since t e has less than 2n e 
edges, and the edge lengths are bounded from above by 1 (under (C2*)) and e £ E™ 2 
implies n e < f(n), we obtain L e < 2n e < 2f(n). Now we have 



(11) 



EM-n<P n e]\ P \ <2 p f(n) p . 
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Combining the bounds (flQ|) and (fTTj) . and noting that \Ef 2 \ < 2n and /(ra) = n 1 
we obtain: 



£ HM-E[tf]H< 



^ -ure "Lr eJ , J - ( jBe 2( 1+0 ( 1 )))p/2 n /3p/2 

e£.Ep 2 

for some constant C(p) > independent of n. Now, since 7 < /3/2, the exponent 
of n in the obtained upper bound is negative for any p > {(3/2 — 7) _1 . Since there 
are some p > 2 satisfying this inequality and consequently satisfying condition ([5]) , 
the proof of Lemma \2. 41 is complete. □ 



We return to the proof of Theorem 12.11 Using Equation ([5]) and the notation of 
Lemma [2T4l we get 



<p n -E[<p n ] = K + Z n - R n - (A n + E[Z n ] - E[R n }) 
v/Var [ip n ] y/Yax\cpn\ 
= C n ijj n + D n 



where 



y/Vai[Z n ] R n -E[R n 
C„ = . . and D n = -- 



yVarf^J " yVarf^J 

By Lemma \2.2\ and the fact that Var[y>„] does not converge to (by Corollary 1 1.2 
Lemma l2~3l and condition (C2*)), we have: 



(12) D n A 0. 

Moreover, by ©, Var[tp„] = Var[Z„] + Var[i?„] - 2 Cov[Z„, R n ], so that 



2 VarK] _ yVarp^] 

Var[Z„] P ^[Z n ~r 

where p is the correlation coefficient of R n and Z n . Now, by Lemma 12.21 we have 
hrttn—Kx, Var[-R„] = 0. Thus, since Var[Z„] is bounded away from (by (|10p ). and 
p € [—1,1] we have: 

(13) lim C n = 1. 

n — >oo 

To complete the proof of Theorem 12. 1[ we apply Slutsky's Theorem [Tj which 

p p 

states that if X n , Y n , W n are sequences of random variables, and X n — > a, Y n — > b, 

(where a, b are constants) and W n — > (for some random variable W) then 

X n W n + Y n — ► aVK + 6. In our setting, we will take X n = C n , Y n = D n , W n = ip n , 
and W — iV(0, 1) (the standard normal random variable). The condition that 

ipn N(0, 1) was established in Lemma [2^41 and the conditions C„ 1, D n —> 
were established in (|13|) and (fT2| (note that the convergence of a sequence of real 
numbers in (fl~3| is just a special case of convergence in probability). Thus, 

(<p n - M[ip n ])/y/Vai[ip n ] = C„Vn + D n ^ N(0, 1), 
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which completes the proof of Theorem 12.11 



□ 



3. Computing the PD distribution 



In this section we describe an algorithm to calculate the distribution of ipj- 
efficiently under the g-FOB model. An approximate distribution could also be 
obtained by simulation, but the approach we present here allows us to derive the 
exact distribution of tp-r- Note that we do not require conditions (CI) or (C2) 
in this section. We make the simplifying assumption that the edge lengths are 
non-negative integer-valued, which implies that ifT can only have values in the set 
{0, 1, . . . , L}, where L = PD(X) = Y^ e A e . This assumption is not problematic in 
practice, as we can rescale all the edge lengths so that they are (arbitrarily close 
to) integer multiples of some small value. 

We also assume that the input tree is such that the root has one outgoing edge 
and all other non-leaf vertices have exactly two outgoing edges. This assumption 
does not affect the generality of our method, as any tree can be modified to satisfy 
it, without changing the distribution for ipx'- one can resolve multifurcations arbi- 
trarily and possibly insert an edge below the root, always assigning length to the 
newly introduced edges. 

Consistent with the notation used before, <p e denotes the contribution to tpx that 
comes from e and the edges separated from the root by e. Then, for any edge e 
and integer x, define 



Also recall that P e = P[Y e = 1]. 

Clearly, if e is the only edge attached to the root of T, then f e and P e are all 
that is needed to derive the distribution of (fq-: simply observe that 



F[ip T = x}= Y[ Ve =x,Y e = l] + F[tp e = x,Y e = 0} = f e (x) + (1 - P e ) ■ I x=0 , 



where I p equals or 1 depending on proposition p being false or true, respectively. 

The algorithm then consists in doing a depth-first (bottom-up) traversal of all 
the edges, so that each time an edge e is visited, the values of P e and f e {x), for all 
x G {A e , A e + 1, . . . , L}, are calculated using the following recursions. We may then 
use the P e and f e (x) values of the root edge to calculate the distribution of ipr- 

Recursion for / e (x). 

• If e leads into leaf i, then 



f e (x) ~ P^ e 



x, Y e = 1]. 



fe(x) 



P[^ e = A e , Y e = l]-4 =Ae 



Pi ■ Ix=\ e ■ 
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• If e leads into the tail of edges c and d, then 

x — Ae-Arf 

(14) / e (z) = fc(i)-fd(x-X e -i) + (l-P d )-f c (x-X e ) + (l-P c )-f d (x-X e ). 

i=\ c 

Note that whenever the term f c (x — A e ) with x — A e < A c or the term fd(x — X e ) 
with x — X e < Xd is used in Equation (fT4|) . the algorithm will assume that its value 
is and that therefore there is no need to calculate and store f e {x) for x outside 
the range {A e , A e + 1, . . . , L}. 

Equation (fl4f is easily proved; we have 

f e (x) = P[^ e = x, Y c = 1, Y d = l}+ F[<p e =x,Y c = 1, Y d = 0] + Pfo> e = x, Y c = 0, Y d = 1] 

= ¥[<p c + <p d = x-X e ,Y c = l,Y d =l] 

+ P[(p c = x-X e ,Y c = 1, Y d = 0] + ¥[(p d = x-X e ,Y c = 0, Y d = 1] 

where the second equality is obtained by restating event <p e = x in terms of ip c and 
ip d , which is possible once we make assumptions on Y c and Y d . Thus, 

x — A e 

f e (x) = ^F[<p c = i,Y c = l]--p[<p d = x-Xe-i,Yd = l] + 

W[<p c = x - X e , Y c = 1] ■ V[Y d = 0] + ¥[<p d =x-X e ,Y d = l]- ¥[Y C = 0] 

= Hi)-fd{x-X e -i) + {l-P d )-f c {x-X e ) + {l-P c )-f d {x-X e ). 

i=\ c 

where the first equality is obtained by using the independence between the survival 
events in C(c) and C(d). Note that in the first expression in the second equality, 
the range of the sum has been reduced, as f c (i) = for i < X c and f d {x — X e — i) = 
for x — A e — i < X d . 

Recursion for P e . 

• If e leads into leaf z, then P e = Pi- 

• If e leads into the tail of edges c and d, then P e = P c + P d — P c P d . 

Computational complexity. For any given e, the calculation of P e is done in 
O(l) time, whereas that of each of the f e (x) values requires 0(x) — O(L) time (see 
recursion (|14ll ). giving a total of 0(L 2 ). Calling n the number of leaves in T, there 
are 2n — 1 edges in T and the entire procedure takes 0(nL 2 ) time. 

A more efficient version of the algorithm can be obtained by restricting the 
calculation of f e (x) to the values of x G {A e ,A e + l,...,L e }, where L e is the 
maximum value that <p e can attain (namely the sum of the lengths of all the edges 
separated from the root by e, including e itself). Note that the sum in (fH)) can 
then be further restricted to the values of i such that i < L c and x — A e — i < L d . 
Using this more efficient algorithm, it is easy to see that the calculation of all the 
fe(x) values for a given internal edge e takes 0(L c L d + L e ) time, where c and d 
are the edges that e leads into. Noting that the sum of all the L c L d terms, for all 
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sister edges c and d, is bounded above by L 2 , this shows that the running time of 
the entire procedure is 0(L 2 + nL). Since typically every pair of taxa in the tree 
is separated by at least one edge of positive length, we have that n = 0(L) and 
therefore the running time above is equivalent to 0(L 2 ). 

Regarding memory requirements, note that each time we calculate the informa- 
tion relative to e (namely P e and f e (x)), the information relative to the edges it 
leads to (if any) can be deleted, as it will never be used again. So, at any given 
moment the information of at most n 'active' edges needs to be stored. If we use the 
range restriction just described, the sizes of the f e {x) vectors for all the active edges 
sum to a number bounded above by n + L, and therefore the algorithm requires 
0(n + L) space, equivalent to 0{L) if n — O(L). 

4. Extension to unrooted PD 

There is a simple modification of the definition of phylogenetic diversity that 
is also relevant in biology ([5], [10]). Given a subset X' of X, we can evaluate 
the sum of the lengths of the edges in the minimum subtree connecting (only) the 
leaves in X' . This score - which we will denote by uPD(X') and refer to as the 
'unrooted PD' score of X 1 - is equivalent to PD(X') if the path connecting two 
leaves in X' traverses the root of T. However, in general, uPD(X') < PD(X') 
(Fig. H shows an example where uPD(X') < PD(X')). This alternative concept of 




Figure 4. If only the taxa marked * in the tree on the left sur- 
vive then future 'unrooted' phylogenetic diversity is the sum of the 
lengths of the solid edges in the tree on the right. Notice that the 
uPD value in this example is less than the PD value (c.f. Fig. 1). 



phylogenetic diversity has the advantage that it can be defined on either rooted or 
unrooted phylogenetic trees. Of course, the g-FOB model is also defined naturally 
on unrooted trees, and so it makes sense to consider the distribution of uPD under 
the g-FOB model in this more general setting. A natural question is whether 
Theorem 12. II is still valid, (that is, is the future uPD of (rooted or unrooted) trees 
also asymptotically normal under conditions (CI) and (C2)?). We now answer this 
question (affirmatively) and also show how to extend the computation of the exact 
future PD distribution to unrooted trees. 

Let the random variable ip' — (pij- denote the uPD score of the random subset 
X' of X (consisting of those taxa that will still exist at some time t in the future). 
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We call (p' the future unrooted phylogenetic diversity. In this model, we have 

(is) ^' = ^A e r e ', 

e 

where Y e is the binary random variable which takes the value 1 if e lies on at least 
one path between some pair of taxa in X' ', and which is otherwise. Moreover, 

(16) p[y e '=i]=(i- n (! -pox 1 - n a-Pi)), 

ieA-i(e) ieV 2 (e) 

where ^i(e) and -^(e) are the bipartition of X consisting of the two subsets of 
X that are separated by edge e. Thus if we let Pi{e) denote the probability that 
at least one taxon in X;(e) survives (for i € {1, 2}), then the expected value of <p' 
(analogous to ((3|)) is 

(17) E[^]=^A e P 1 (e)P 2 (e). 

e 

Regarding Var [<£>'] there is an analogous formula to that given in Lemma 1 1.1 1 

Consider now a sequence T X ,T 2 , ... ,T n , ... of (rooted or unrooted) phylogenetic 
trees where T n has n leaves, and assume that this sequence satisfies conditions (CI) 
and (C2) when each T n has associated edge length and leaf survival probability 
functions. It can be shown that Theorem 12.11 is still valid for uPD; that is, under 
the same conditions, (ip' n — E[<^])/yVar[^J converges in distribution to iV(0, 1) 
as n — > oo. 

To establish this asymptotic normality of ip' n under conditions (CI) and (C2*) 
(and thereby (CI) and (C2)) requires slight modifications to the proof of Theo- 
rem [2T] and we now provide an outline of the argument. The main difference is 
that now each edge e induces a bipartition X = X\{e) U X 2 (e) of the taxon set 
and so we decompose T n in a slightly different way. For simplicity, assume that 
\X\ (e) | < |X 2 (e)| and consider the following edge sets (the definition of the function 
f(n) is as in the rooted case): 

E%: edges e of T„ with |Xi(e)| < f{n). 
££: edges e of T n with \X x {e)\ > f (n). 

Ef 2 - edges e € E™ such that e is adjacent to an edge / € E 2 ■ 
For ip' n we obtain the following equation: 



(is) v' n =J2 X - Y - + E x ? Y e = E X - Y e + A « - K> 

eeE™ e£-EJ eEE? 

where A„ = 2 eeE n A e and R' n = J2 e eE? -M 1 ~ Y D- For an ed g e e e E in let *e 
denote the subtree with root edge e and with leaf set X\{e). Let (</?")' denote the 
contribution to tp' n by the edges in t e . Furthermore, let ip™ be the rooted future 
phylogenetic diversity of t e , Z n = J2eeE™ 2 <Pe as m * ne rooted case, W e — <p™ — (ip™)' 
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and V n — J2eeEY 2 With this notation, we get 

(19) <p' n = W+Xn-K = E E W e+Xn-K = Z n -V n +\ n -R' n 

Now we can apply Lemma 12.41 and Slutsky's Theorem to complete the proof. 



4.1. Computing the uPD distribution. Finally we show how the algorithm de- 
scribed in Section [3] for computing the PD distribution can be modified to calculate 
the distribution of unrooted PD. As before, we assume the edge lengths are non- 
negative integers and we preprocess T (possibly rooting it in an arbitrary vertex) so 
that the number of outgoing edges is 1 for the root and 2 for all the other non-leaf 
vertices. Since T is now rooted, C(e) and the random variables Y e are well defined. 
We also define ip' e as the uPD of the surviving taxa in C(e). Then, for any integer 
x, define 

f' e (x) :=Wy e = x,Y e = l\. 

As before, if e is the root edge of T, then f' e and P e are sufficient to derive the 
distribution of ip' T : 

F[ip' T = x}=f' e (x) + (l-P e )-I x = . 

An algorithm to calculate the distribution of ifi' T can be obtained with a simple 
modification of the algorithm for (p-r- for each edge e, in addition to calculating 
P e and f e (x), also calculate f' e {x), for all x <E {0, 1, . . . , L}. For this purpose, the 
following recursion is used (note that the f' e values may depend on f c and fd as 
well as on f' c and f d , which is why we retain the calculation of the f e values even 
though they are not directly implicated in determining V[ip' T = x]). 



Recursion for f' e (x). 

• If e leads into leaf i, then 

f' e (x) = Pi ■ I x=0 . 

• If e leads into the tail of edges c and d, then 

(20) f e {x) - ■ f^ x - + (1 " Pd) ■ fci?) + (1 - Pc) ■ f' d {x), 

i=X c 

which is proved in a way similar to (|14p: 

f' e (x) = F[ip' e = Xl Y c = 1, Y d = 1] + F[<p' e =x,Y c = 1, Y d = 0] + ¥[ V ' e = x,Y c = 0, Y d = 1] 

= P[^ c + ip d = x, Y c = 1, Y d = l}+ Fy c =x,Y c = 1, Y d = 0] + F[y d = x, Y c = 0, Y d = 1] 

= E ■ fd(x - i) + (1 - P d ) ■ f' c (x) + (1 - P c ) • f' d (x). 

l=Xa 
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5. Concluding remarks 

The main result of this paper (Theorem 12. ip has been to establish a limiting 
normal distribution for future PD on large phylogenetic trees. This theorem as- 
sumes an underlying generalized 'field of bullets' model, and imposes two further 
mild conditions (conditions (CI) and (C2)). In this setting Theorem 12.11 reduces 
the problem of computing the distribution of future PD to that of determining 
just two parameters - its mean and variance - and these can be readily computed 
by Equation ([3]) and Lemma fl. II Using the resulting normal distribution one can 
easily compute the probability under the model that future PD will fall below any 
given critical value. This may also be helpful in designing strategies to minimize 
this probability (analogous to the 'Noah's Ark problem' which tries to maximize 
expected future PD). 

In practice, the use of a normal distribution based on Theorem 12 . 1 1 req uires that 
the number of taxa is moderate (> 50), that the survival probabilities are not too 
extreme (condition (CI)), and that the length of the pendant edges on average are 
not too small in relation to the largest edge length in the tree (condition (C2)). 
If these conditions are violated, it would be prudent to use the exact algorithm 
we have described in the paper, as this requires neither a large number of taxa nor 
condition (CI) or (C2). To apply this algorithm may involve some small adjustment 
to the edge lengths to make them integral multiples of some common value. 

Regarding the 'mild conditions' for Theorem 12.11 (namely (CI) and (C2)), we 
showed that neither can be dropped completely from the statement of the theorem. 
However it is likely that both conditions could be weakened somewhat, though at 
the risk of complicating the description of the conditions and the proof of Theorem 
1231 

It would be interesting to explore other extinction models that weaken the strong 
assumption in the g-FOB model that taxon extinction events are independent. One 
such model would regard extinction as a continuous-time Markov process in which 
the extinction rate of a taxon i at any given time t is the product of an intrinsic 
extinction rate r% , with a factor that depends on the set of species in the tree that 
are extant at time t. In general, such processes could be very complex, so a first 
step would be to identify a simple model that nevertheless captures more biological 
realism than the g-FOB model. 



6. Appendix 

Condition (C2) is satisfied in expectation for trees generated under a continuous- 
time pure-birth model. 

Consider a model where each lineages persist for a random period of time before 
speciating, and that these persistence times are i.i.d. random variables with ex- 
ponential distribution with mean s > (This model is sometimes called the 'Yule 
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model' in phylogenetics). Now suppose we sample the process at some time dur- 
ing which the tree has n leaves. Let T n denote this tree (with its associated edge 
lengths) and let fip(n) denote the average length of the pendant edges of T n . 

Proposition 6.1. Under a constant birth model with speciation rate s and [3 € 
(0, 1) there is a constant B > for which the expected value of 

J2 {x^) 2 -Bn^L(nf 

e£E P (T n ) 

is strictly positive for all n > 3. 

Proof. By using the inequality Yl7=i x i — «(EILi x i) 2 we have 

E[ Yl ( A i™ } ) 2 - Bn^L(n) 2 } > nE[u P (n)] 2 - Bn^E[L(n) 2 } 

e£E P (T n ) 

and the proposition follows (by choice of a sufficiently small value of B > 0) once 
we establish the following two results. 

(i) E[np(n)] > s/6 for all n > 3. 

(ii) n~ v E[L(n) 2 ] —> as n — > oo, for any n > 0. 

To establish result (i), let S n denote the sum of the lengths of the pendant edges 
of T n up till the point when the number of species first changes from n — 1 to 
n, and excluding the (length of the) pendant edge on which this speciation event 
occurs. Thus S n is a sum of lengths of n — 2 pendant edges. [For example, S3 
has an exponential distribution with mean s/2, as it is the length of the edge that 
does not first speciate, up until the time when one of the two edges in the tree first 
speciates] . Since we are observing the tree T n at some later time (but while it still 
has n leaves) then we clearly have: 

(21) np(n) > -S n . 

n 

We will derive a recursion for the sequence (E[S n ],n = 3,4,...). Let 9 n be an 
exponentially-distributed random variable with mean s/n. Now, the random vari- 
able SV1+1 takes the value S n + (n — l)6 n , with probability 2/n (this is the case 
where the next speciation event occurs on one of the two edges that develop from 
the last speciation event). Otherwise (and so with probability 1 — 2/n), S n +i takes 
the value S n + (n — l)0 n — A e , where A e is the length of one of the n — 2 pendant 
edges that contribute to S n (selected uniformly at random from this set of edges). 

Consequently, 

E[S n+1 ] = ^(E[5 n ] + (n-l)-) + (l--)(E[S„](l--^-5) + (n-l)-) 

n n n n — 2 n 

11 — 1 

= (E[5„] + ,s). 

n 

By using the initial condition E[Ss] — s/2, and this recursion, we have that E[S n ] = 
ns/2 — s for all n > 3. Taking expectations on both sides of inequality (|2Tj) gives 

E[^ P (n)]>-(ns/2-s)= S --->^ 
n 2 n 6 
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for all n > 3, thus proving (i). 

To establish result (ii), observe that length of the longest edge in T n (namely 
L(n)) is bounded above by the length of the longest edge in the tree obtained from 
T n by allowing each leaf to evolve until it next speciates. Now, the lengths of the 
edges in this resulting trees is a set of |£J(7^)| independent random variables each 
having an exponential distribution with mean s (here |i5(7^)| is the number of edges 
of T n , which is at most 2n — 1). Thus, if we let Y n be the maximum of 2n — 1 i.i.d 
exponentially-distributed random variables, each with mean s, then L(n) < Y n . 
Moreover, for any x > we have: 

/•OC /"OO 

(22) E[y„ 2 ] = / P[y„ 2 > y]dy < x 2 + P[Y 2 > y]dy, 

JO Jx 2 

where the first equality in l|22[) is a standard identity in probability theory for any 
non-negative random variable Y 2 . Now, by Boole's inequality, 

P[F„ 2 > y] = F[Y n > y/y\ < (2n - 1) exp(-^y/a). 



Making the substitution y = x 2 + t 2 , and applying the inequality \/ x 2 + 1 2 > 
we obtain 

r°° x f°° t 

/ F[Y 2 >y]dy <2(2n-l)exp( ■=) / texp( ■=)&. 

Jx 2 sy2 Jt=0 Sy/Z 

Thus, taking x = cslog(n), for any c > in (|22|) we obtain 
E[L(n) 2 ] < E[Y 2 } < cV(log(n)) 2 + o(l), 
(where o(l) is a term that tends to as n — > oo) and result (ii) now follows. □ 
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